Regular Expressions in R

Author

Martin Schweinberger

Introduction

This tutorial introduces regular expressions and how they can be used when working with language data. Regular expressions are powerful tools used to search and manipulate text patterns. They provide a way to find specific sequences of characters within larger bodies of text. Think of them as search patterns on steroids. Regular expressions are useful for tasks like extracting specific words, finding patterns, or replacing text in bulk. They offer a concise and flexible way to describe complex text patterns using symbols and special characters. Regular expressions have applications in linguistics and humanities research, aiding in tasks such as text analysis, corpus linguistics, and language processing. Understanding regular expressions can unlock new possibilities for exploring and analysing textual data.

This tutorial is aimed at beginners and intermediate users of R with the aim of showcasing how to use regular expression (or wild cards) in R. The aim is not to provide a fully-fledged analysis but rather to show and exemplify selected useful functions and methods associated with regular expressions.

To be able to follow this tutorial, we suggest you check out and familiarise yourself with the content of the following R Basics tutorials:

Click here1 to download the entire R Notebook for this tutorial.

Click here to open an interactive Jupyter notebook that allows you execute, change, and edit the code as well as upload your own data.


How can you search texts for complex patterns or combinations of patterns? This question will be answered in this tutorial and at the end you will be able to perform very complex searches yourself. The key concept of this tutorial is that of a regular expression. A regular expression (in short also called regex or regexp) is a special sequence of characters (or string) for describing a search pattern. You can think of regular expressions as very powerful combinations of wildcards or as wildcards on steroids.

If you would like to get deeper into regular expressions, I can recommend Friedl (2006) and, in particular, chapter 17 of Peng (2020) for further study (although the latter uses base R rather than tidyverse functions, but this does not affect the utility of the discussion of regular expressions in any major or meaningful manner). Also, here is a so-called cheatsheet about regular expressions written by Ian Kopacka and provided by RStudio. Nick Thieberger has also recorded a very nice Introduction to Regular Expressions for humanities scholars, which is available on YouTube.

Preparation and session set up

This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to R and more information on how to use R here. For this tutorial, we need to install certain packages from an R library so that the scripts shown below can be executed without errors. Before turning to the code in the sections below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead and ignore this section. To install the necessary packages, simply run the following code. Don’t worry if it takes a while - it may take up to 5 minutes to install all of the libraries.

# set options
options(stringsAsFactors = F) # no automatic data transformation
options("scipen" = 100, "digits" = 4) # suppress math annotation
# install packages
install.packages("tidyverse")
install.packages("flextable")
install.packages("htmlwidgets")
# install klippy for copy-to-clipboard button in code chunks
install.packages("remotes")
remotes::install_github("rlesur/klippy")

In a next step, we load the packages.

library(tidyverse)
library(flextable)
# activate klippy for copy-to-clipboard button
klippy::klippy()

Once you have installed RStudio and have initiated the session by executing the code shown above, you are good to go.

Common Regex Patterns and Examples

We start by using a very simply example, the sentence (string) The cat sat on the mat. that we use in the overview of regular expressions.

text <- "the cat sat on the mat."

But before we delve into using regular expressions, we will have a look at common regular expression patters that can be used in R and check what they stand for. Below is an overview of differnt types of regular expressions that can be used in R.

1. Basic Characters

Basic Characters match specific characters in a string, including individual letters and wildcard characters like . that can match any character except a newline.

Pattern Meaning Example
a Matches the character ‘a’ str_detect(text, "a")TRUE
. Matches any character except newline str_detect(text, "c.t")TRUE

2. Anchors

Anchors are used to match positions within a string, such as the start (^) or end ($), ensuring that patterns only match in specific locations.

Pattern Meaning Example
^ Start of the string str_detect(text, "^The")TRUE
$ End of the string str_detect(text, "mat.$")TRUE

3. Character Classes

Character Classes define sets of characters that can match at a given position, including predefined ranges (e.g., [a-z] for lowercase letters) and negated classes ([^abc] to exclude certain characters).

Pattern Meaning Example
[abc] Matches ‘a’, ‘b’, or ‘c’ str_detect(text, "[abc]")TRUE
[^abc] Matches any character except ‘a’, ‘b’, or ‘c’ str_detect(text, "[^abc]")TRUE
[a-z] Matches any lowercase letter str_detect(text, "[a-z]")TRUE
[A-Z] Matches any uppercase letter str_detect(text, "[A-Z]")TRUE
[0-9] Matches any digit str_detect(text, "[0-9]")FALSE

4. Quantifiers

Quantifiers specify how many times a character or group should appear, ranging from zero or more (*), one or more (+), optional (?), or exact ({n,m}).

Pattern Meaning Example
* 0 or more occurrences str_detect(text, "a*")TRUE
+ 1 or more occurrences str_detect(text, "a+")TRUE
? 0 or 1 occurrence str_detect(text, "a?")TRUE
{n} Exactly n occurrences str_detect(text, "a{2}")FALSE
{n,} n or more occurrences str_detect(text, "a{1,}")TRUE
{n,m} Between n and m occurrences str_detect(text, "a{1,2}")TRUE

5. Groups and Alternation

Groups and Alternation enable advanced pattern matching with capturing groups () to extract portions of a match and the | operator to specify alternative matches.

Pattern Meaning Example
() Capturing group str_replace_all(text, "(cat)", "dog")"The dog sat on the mat."
| OR operator str_detect(text, "cat|dog")TRUE

6. Character Classes

Character class shortcuts provide an efficient way to match groups of characters, such as digits, letters, punctuation, and spaces. These predefined classes help simplify regex patterns and improve readability when searching for specific types of characters in text data. The table below outlines commonly used character classes and their meanings.

RegEx Symbol/Sequence Explanation
[ab] lower case a and b
[a-z] all lower case characters from a to z
[AB] upper case A and B
[A-Z] all upper case characters from A to Z
[12] digits 1 and 2
[0-9] digits: 0 1 2 3 4 5 6 7 8 9
[:digit:] digits: 0 1 2 3 4 5 6 7 8 9
[:lower:] lower case characters: a–z
[:upper:] upper case characters: A–Z
[:alpha:] alphabetic characters: a–z and A–Z
[:alnum:] digits and alphabetic characters
[:punct:] punctuation characters: . , ; etc.
[:graph:] graphical characters: [:alnum:] and [:punct:]
[:blank:] blank characters: Space and tab
[:space:] space characters: Space, tab, newline, and other space characters

7. Special Escape Sequences

Special Escape Sequences provide shorthand for common patterns, such as or digits, for whitespace, and for word characters, improving regex readability and efficiency. The upper case variants (in contrast to ) represent the negations of the lower case variants, i.e. means non-word characters while refers to word-characters (alphanumeric characters).

Pattern Meaning Example
\w Word characters: [[:alnum:]_] str_detect(text, "\w")TRUE
\W No word characters: [^[:alnum:]_] str_detect(text, "\W")TRUE
\s Space characters: [[:blank:]] str_detect(text, "\s")TRUE
\S No space characters: [^[:blank:]] str_detect(text, "\S")TRUE
\d Digits: [[:digit:]] str_detect(text, "\d")FALSE
\D No digits: [^[:digit:]] str_detect(text, "\D")TRUE
\b Word edge str_detect(text, "\bcat\b")TRUE
\B No word edge str_detect(text, "\Bcat\B")FALSE

Practice

To put regular expressions into practice, we need some text that we will perform our searches on. In this tutorial, we will use texts from Wikipedia about grammar.

# read in first text
text1 <- readLines("tutorials/regex/data/testcorpus/linguistics02.txt")
et <- paste(text1, sep = " ", collapse = " ")
# inspect example text
et
[1] "Grammar is a system of rules which governs the production and use of utterances in a given language. These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences). Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."

In addition, we will split the example text into words to have another resource we can use to understand regular expressions.

# split example text
set <- str_split(et, " ") %>%
    unlist()
# inspect
head(set)
[1] "Grammar" "is"      "a"       "system"  "of"      "rules"  

We now put regular expressions into practice and explore how to use regular expressions in functions from the stringr package which are often used when processing text data.

Show all words in the split example text that contain a or n.

set[str_detect(set, "[an]")]
 [1] "Grammar"      "a"            "governs"      "production"   "and"         
 [6] "utterances"   "in"           "a"            "given"        "language."   
[11] "apply"        "sound"        "as"           "as"           "meaning,"    
[16] "and"          "include"      "componential" "as"           "pertaining"  
[21] "phonology"    "organisation" "phonetic"     "sound"        "formation"   
[26] "and"          "composition"  "and"          "syntax"       "formation"   
[31] "and"          "composition"  "phrases"      "and"          "sentences)." 
[36] "Many"         "modern"       "that"         "deal"         "principles"  
[41] "grammar"      "are"          "based"        "on"           "Noam"        
[46] "framework"    "generative"   "linguistics."

Show all words in the split example text that begin with a lower case a.

set[str_detect(set, "^a")]
 [1] "a"     "and"   "a"     "apply" "as"    "as"    "and"   "as"    "and"  
[10] "and"   "and"   "and"   "are"  

Show all words in the split example text that end in a lower case s.

set[str_detect(set, "s$")]
 [1] "is"         "rules"      "governs"    "utterances" "rules"     
 [6] "as"         "as"         "subsets"    "as"         "phrases"   
[11] "theories"   "principles" "Chomsky's" 

Show all words in the split example text in which there is an e, then any other character, and then an n.

set[str_detect(set, "e.n")]
[1] "governs"  "meaning," "modern"  

Show all words in the split example text in which there is an e, then two other characters, and then an n.

set[str_detect(set, "e.{2,2}n")]
[1] "utterances"

Show all words that consist of exactly three alphabetical characters in the split example text.

set[str_detect(set, "^[:alpha:]{3,3}$")]
 [1] "the" "and" "use" "and" "and" "and" "and" "and" "the" "are"

Show all words that consist of six or more alphabetical characters in the split example text.

set[str_detect(set, "^[:alpha:]{6,}$")]
 [1] "Grammar"      "system"       "governs"      "production"   "utterances"  
 [6] "include"      "componential" "subsets"      "pertaining"   "phonology"   
[11] "organisation" "phonetic"     "morphology"   "formation"    "composition" 
[16] "syntax"       "formation"    "composition"  "phrases"      "modern"      
[21] "theories"     "principles"   "grammar"      "framework"    "generative"  

Replace all lower case as with upper case Es in the example text.

str_replace_all(et, "a", "E")
[1] "GrEmmEr is E system of rules which governs the production End use of utterEnces in E given lEnguEge. These rules Epply to sound Es well Es meEning, End include componentiEl subsets of rules, such Es those pertEining to phonology (the orgEnisEtion of phonetic sound systems), morphology (the formEtion End composition of words), End syntEx (the formEtion End composition of phrEses End sentences). MEny modern theories thEt deEl with the principles of grEmmEr Ere bEsed on NoEm Chomsky's frEmework of generEtive linguistics."

Remove all non-alphabetical characters in the split example text.

str_remove_all(set, "\\W")
 [1] "Grammar"      "is"           "a"            "system"       "of"          
 [6] "rules"        "which"        "governs"      "the"          "production"  
[11] "and"          "use"          "of"           "utterances"   "in"          
[16] "a"            "given"        "language"     "These"        "rules"       
[21] "apply"        "to"           "sound"        "as"           "well"        
[26] "as"           "meaning"      "and"          "include"      "componential"
[31] "subsets"      "of"           "rules"        "such"         "as"          
[36] "those"        "pertaining"   "to"           "phonology"    "the"         
[41] "organisation" "of"           "phonetic"     "sound"        "systems"     
[46] "morphology"   "the"          "formation"    "and"          "composition" 
[51] "of"           "words"        "and"          "syntax"       "the"         
[56] "formation"    "and"          "composition"  "of"           "phrases"     
[61] "and"          "sentences"    "Many"         "modern"       "theories"    
[66] "that"         "deal"         "with"         "the"          "principles"  
[71] "of"           "grammar"      "are"          "based"        "on"          
[76] "Noam"         "Chomskys"     "framework"    "of"           "generative"  
[81] "linguistics" 

Remove all white spaces in the example text.

str_remove_all(et, " ")
[1] "Grammarisasystemofruleswhichgovernstheproductionanduseofutterancesinagivenlanguage.Theserulesapplytosoundaswellasmeaning,andincludecomponentialsubsetsofrules,suchasthosepertainingtophonology(theorganisationofphoneticsoundsystems),morphology(theformationandcompositionofwords),andsyntax(theformationandcompositionofphrasesandsentences).ManymoderntheoriesthatdealwiththeprinciplesofgrammararebasedonNoamChomsky'sframeworkofgenerativelinguistics."

Highlighting patterns

We use the str_view and str_view_all functions to show the occurrences of regular expressions in the example text.

To begin with, we match an exactly defined pattern (ang).

str_view_all(et, "ang")
[1] │ Grammar is a system of rules which governs the production and use of utterances in a given l<ang>uage. These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences). Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics.

Now, we include . which stands for any symbol (except a new line symbol).

str_view_all(et, ".n.")
[1] │ Grammar is a system of rules which gove<rns> the producti<on ><and> use of utter<anc>es <in >a giv<en >l<ang>uage. These rules apply to so<und> as well as me<ani>ng, <and> <inc>lude comp<one>ntial subsets of rules, such as those perta<ini>ng to ph<ono>logy (the org<ani>sati<on >of ph<one>tic so<und> systems), morphology (the formati<on ><and> compositi<on >of words), <and> s<ynt>ax (the formati<on ><and> compositi<on >of phrases <and> s<ent><enc>es). M<any> mode<rn >theories that deal with the pr<inc>iples of grammar are based <on >Noam Chomsky's framework of g<ene>rative l<ing>uistics.

EXERCISE TIME!

`

  1. What regular expression can you use to extract all forms of walk from a tokenised text?
Answer

text[stringr::str_detect(text, "[Ww][Aa][Ll][Kk].*")

  1. What regular expression can you use to extract all words that start with “un” in a tokenised text?
Answer

text[stringr::str_detect(text, "\\b[Uu][Nn]\\w*")]

  1. What regular expression can you use to find all occurrences of numbers in a tokenised text?
Answer

text[stringr::str_detect(text, "\\b\\d+\\b")]

  1. What regular expression can you use to extract all words ending in “ing” from a tokenised text?
Answer

text[stringr::str_detect(text, "\\b\\w+[Ii][Nn][Gg]\\b")]

  1. What regular expression can you use to extract email addresses from a tokenised text?
Answer

text[stringr::str_detect(text, "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}")]

  1. What regular expression can you use to identify words that contain at least one digit?
Answer

text[stringr::str_detect(text, "\\w*\\d\\w*")]

  1. What regular expression can you use to extract all words that contain a hyphen (e.g., “well-being”) from a tokenised text?
Answer

text[stringr::str_detect(text, "\\b\\w+-\\w+\\b")]

  1. What regular expression can you use to find all capitalized words (e.g., proper nouns) in a tokenised text?
Answer

text[stringr::str_detect(text, "\\b[A-Z][a-z]+\\b")]

  1. What regular expression can you use to extract all sentences that end with a question mark from a tokenised text?
Answer

text[stringr::str_detect(text, ".*\\?$")]

  1. What regular expression can you use to find all words that contain double vowels (e.g., “book”, “agree”) in a tokenised text?
Answer

text[stringr::str_detect(text, "\\b\\w*[aeiouAEIOU]{2}\\w*\\b")]


Regular expressions are a powerful tool for handling text data in R. Mastering them allows for efficient pattern matching, data extraction, and text transformation. Experiment with different patterns and functions from the stringr package to become proficient in regex for data science and text analysis in R.

Citation & Session Info

Schweinberger, Martin. 2025. Regular Expressions in R. Brisbane: The University of Queensland. url: https://ladal.edu.au/tutorials/regex/regex.html (Version 2025.04.01).

@manual{schweinberger2025regex,
  author = {Schweinberger, Martin},
  title = {Regular Expressions in R},
  note = {tutorials/regex/regex.html},
  year = {2025},
  organization = {The University of Queensland, School of Languages and Cultures},
  address = {Brisbane},
  edition = {2025.04.01}
}
sessionInfo()
R version 4.4.2 (2024-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.12.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Australia/Brisbane
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] flextable_0.9.7 lubridate_1.9.4 forcats_1.0.0   stringr_1.5.1  
 [5] dplyr_1.1.4     purrr_1.0.2     readr_2.1.5     tidyr_1.3.1    
 [9] tibble_3.2.1    ggplot2_3.5.1   tidyverse_2.0.0

loaded via a namespace (and not attached):
 [1] generics_0.1.3          fontLiberation_0.1.0    renv_1.0.11            
 [4] xml2_1.3.6              stringi_1.8.4           hms_1.1.3              
 [7] digest_0.6.37           magrittr_2.0.3          evaluate_1.0.3         
[10] grid_4.4.2              timechange_0.3.0        fastmap_1.2.0          
[13] jsonlite_1.8.9          zip_2.3.1               scales_1.3.0           
[16] fontBitstreamVera_0.1.1 klippy_0.0.0.9500       textshaping_0.4.1      
[19] codetools_0.2-20        cli_3.6.3               rlang_1.1.5            
[22] fontquiver_0.2.1        munsell_0.5.1           withr_3.0.2            
[25] yaml_2.3.10             gdtools_0.4.1           tools_4.4.2            
[28] officer_0.6.7           uuid_1.2-1              tzdb_0.4.0             
[31] colorspace_2.1-1        assertthat_0.2.1        vctrs_0.6.5            
[34] R6_2.5.1                lifecycle_1.0.4         htmlwidgets_1.6.4      
[37] ragg_1.3.3              pkgconfig_2.0.3         pillar_1.10.1          
[40] gtable_0.3.6            glue_1.8.0              data.table_1.16.4      
[43] Rcpp_1.0.13-1           systemfonts_1.1.0       xfun_0.49              
[46] tidyselect_1.2.1        knitr_1.49              htmltools_0.5.8.1      
[49] rmarkdown_2.29          compiler_4.4.2          askpass_1.2.1          
[52] openssl_2.3.0          

Back to top

Back to LADAL home


References

Friedl, Jeffrey EF. 2006. Mastering Regular Expressions. Sebastopol, CA: "O’Reilly Media".
Peng, Roger D. 2020. R Programming for Data Science. Leanpub. https://bookdown.org/rdpeng/rprogdatascience/.

Footnotes

  1. If you want to render the R Notebook on your machine, i.e. knitting the document to html or a pdf, you need to make sure that you have R and RStudio installed and you also need to download the bibliography file and store it in the same folder where you store the Rmd file.↩︎